import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
kb = pd.read_csv('./data.csv.zip')
Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.
Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?
Acknowledgements
Kaggle is hosting this competition for the data science community to use for fun and education. For more data on Kobe and other NBA greats, visit stats.nba.com.
This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag) .
We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.
To avoid leakage, your method should only train on events that occurred prior to the shot for which you are predicting! Since this is a playground competition with public answers, it's up to you to abide by this rule.
For each missing shot_made_flag in the data set, you should predict a probability that Kobe made the field goal. The file should have a header and the following format:
shot_id,shot_made_flag
1,0.5
8,0.5
17,0.5
They will be evaluated through the log loss cost function.
kb.head()
| action_type | combined_shot_type | game_event_id | game_id | lat | loc_x | loc_y | lon | minutes_remaining | period | ... | shot_type | shot_zone_area | shot_zone_basic | shot_zone_range | team_id | team_name | game_date | matchup | opponent | shot_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jump Shot | Jump Shot | 10 | 20000012 | 33.9723 | 167 | 72 | -118.1028 | 10 | 1 | ... | 2PT Field Goal | Right Side(R) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000-10-31 | LAL @ POR | POR | 1 |
| 1 | Jump Shot | Jump Shot | 12 | 20000012 | 34.0443 | -157 | 0 | -118.4268 | 10 | 1 | ... | 2PT Field Goal | Left Side(L) | Mid-Range | 8-16 ft. | 1610612747 | Los Angeles Lakers | 2000-10-31 | LAL @ POR | POR | 2 |
| 2 | Jump Shot | Jump Shot | 35 | 20000012 | 33.9093 | -101 | 135 | -118.3708 | 7 | 1 | ... | 2PT Field Goal | Left Side Center(LC) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000-10-31 | LAL @ POR | POR | 3 |
| 3 | Jump Shot | Jump Shot | 43 | 20000012 | 33.8693 | 138 | 175 | -118.1318 | 6 | 1 | ... | 2PT Field Goal | Right Side Center(RC) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000-10-31 | LAL @ POR | POR | 4 |
| 4 | Driving Dunk Shot | Dunk | 155 | 20000012 | 34.0443 | 0 | 0 | -118.2698 | 6 | 2 | ... | 2PT Field Goal | Center(C) | Restricted Area | Less Than 8 ft. | 1610612747 | Los Angeles Lakers | 2000-10-31 | LAL @ POR | POR | 5 |
5 rows × 25 columns
kb.shape
(30697, 25)
From the looks of our above table, we have some locational data, a fair number of categorical features, and the respresentation of id-based features, which are unlikely to be used later on. We seem to have a good number of total observations (not big data, but small data constrictions are unlikely).
kb.isna().sum().sum()
5000
kb.isna().sum().sort_values(ascending = False).head(2)
shot_made_flag 5000 action_type 0 dtype: int64
We have 5000 missing values total; as stated in the Kaggle description, this is our test set, so we'll be sure to extract them.
In other words, we will not have any missing values once the split occurs.
kb.describe()
| game_event_id | game_id | lat | loc_x | loc_y | lon | minutes_remaining | period | playoffs | seconds_remaining | shot_distance | shot_made_flag | team_id | shot_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 30697.000000 | 3.069700e+04 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 30697.000000 | 25697.000000 | 3.069700e+04 | 30697.000000 |
| mean | 249.190800 | 2.476407e+07 | 33.953192 | 7.110499 | 91.107535 | -118.262690 | 4.885624 | 2.519432 | 0.146562 | 28.365085 | 13.437437 | 0.446161 | 1.610613e+09 | 15349.000000 |
| std | 150.003712 | 7.755175e+06 | 0.087791 | 110.124578 | 87.791361 | 0.110125 | 3.449897 | 1.153665 | 0.353674 | 17.478949 | 9.374189 | 0.497103 | 0.000000e+00 | 8861.604943 |
| min | 2.000000 | 2.000001e+07 | 33.253300 | -250.000000 | -44.000000 | -118.519800 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.610613e+09 | 1.000000 |
| 25% | 110.000000 | 2.050008e+07 | 33.884300 | -68.000000 | 4.000000 | -118.337800 | 2.000000 | 1.000000 | 0.000000 | 13.000000 | 5.000000 | 0.000000 | 1.610613e+09 | 7675.000000 |
| 50% | 253.000000 | 2.090035e+07 | 33.970300 | 0.000000 | 74.000000 | -118.269800 | 5.000000 | 3.000000 | 0.000000 | 28.000000 | 15.000000 | 0.000000 | 1.610613e+09 | 15349.000000 |
| 75% | 368.000000 | 2.960047e+07 | 34.040300 | 95.000000 | 160.000000 | -118.174800 | 8.000000 | 3.000000 | 0.000000 | 43.000000 | 21.000000 | 1.000000 | 1.610613e+09 | 23023.000000 |
| max | 659.000000 | 4.990009e+07 | 34.088300 | 248.000000 | 791.000000 | -118.021800 | 11.000000 | 7.000000 | 1.000000 | 59.000000 | 79.000000 | 1.000000 | 1.610613e+09 | 30697.000000 |
We're looking at the numeric features to find any potential outliers. This also gives us a good look at them in general. Most don't provide much relevant information; the id-based features pollute this section, and there are a few binary features that should be considered as categorical.
An interesting result to note: the shot_distance attributed have a minimum value of 0. This could realistically happen (shots made as dunks and the like), but I want to make sure that they aren't standins for missing values.
## Create array of Laker colors
cols = ['#FDB927','#552583']
## Transform into palette
pal = sns.color_palette(cols)
# Configure background image and figure size
plt.figure(figsize = (20,20))
img = plt.imread('bb_court.jpg')
plt.imshow(img, extent=[33.07,34.17,-118.59,-117.95])
plt.axis('off')
# plot data
sns.scatterplot(data = kb, x = 'lat', y = 'lon', hue = 'shot_made_flag', alpha = 0.8, palette = pal);
#plt.savefig('./overview.png')
This is the career of a professional basketball player visualized.
Purely based on observation, it seems that Kobe performed the majority of his shots from within the three-point line. It also looks like there is no clear relationship between his successful shots and where along the extended free throw line he was (i.e his location along the y-axis did not seem to matter).
kb.shot_made_flag.value_counts()
0.0 14232 1.0 11465 Name: shot_made_flag, dtype: int64
Ignoring the observations in the test set, it seems that Kobe missed slightly more shots than he made, averaging around a 44.6% shot completion rate. As is expected, the target feature calls for binary classification, and there is no need for any form of transformation.
## Find observations with missing shot_made_flag values
test_index = kb[kb.shot_made_flag != kb.shot_made_flag].index
## split between overall train and testing sets
train = kb.drop(index=test_index)
test = kb.loc[test_index]
## Break further into X and y for each
Xtr, ytr = train.drop(['shot_made_flag'], axis = 1), train.shot_made_flag
Xte, yte = test.drop(['shot_made_flag'], axis = 1), test.shot_made_flag
Now that we have our training and testing sets, we can safely perform some EDA.
Let's look at my prior concerns about potential irregularities in the shot_distance feature.
Remember that the lowest value was zero, which I was worried may have been a standin for a missing value. That said, it is more likely that these values represent dunks made by Kobe. We'll be making sure.
plt.figure(figsize=(12,8))
sns.histplot(Xtr.shot_distance, color = '#F3B21D')
plt.title("Distribution of Shot Distances");
Its clear that most of the our shots occurred from at the net. These can include dunks, layups, and close shots. Let's get a little bit of accounting done.
temp = Xtr[Xtr['shot_distance'] == 0]['combined_shot_type'].value_counts()
plt.figure(figsize = (12,8))
sns.barplot(x = temp.index, y = temp.values)
plt.title("Shots at the Net");
After looking at the type of shots that had occurred at the net (i.e. with a distance of 0), you can tell that those values are genuine rather than missing. As surmised, layups and dunks accounted for the vast majority of the close shots, so they are definitely not standins for missing values.
pal = sns.color_palette(cols)
plt.figure(figsize=(12,12))
sns.boxplot(x = np.where(ytr == 1, 'Score', 'Miss'), y = Xtr.shot_distance, palette=pal)
plt.ylabel("Shot Distance")
plt.title('Bivariate Analysis Between Shot Distance and Score Flag');
Knowing even a little bit about basketball, you could likely guess that a shot's distance influenced whether or not it would be a field goal or miss. This boxplot just confirms that relationship. Shots made closer to the net are far more likely to succeed than those attempted from farther away.
In addition, according to the data, shots attempted from more than 45 feet never succeeded.
Next up we have two categorical features; at first blush, action shot and shot type seem very similar since both essentially describe the type of shot performed by Kobe. They overlap heavily, so we need to see their relationship as well as how they collectively correlate to shot percentage. What I ended up finding is that both have problems, which will be showcased below:
Let's look at both:
action_crosstab = pd.crosstab(index = Xtr.action_type, columns = np.where(ytr == 1, 'Score', 'Miss')).sort_values(by = 'Score', ascending = False)
temp = pd.melt(action_crosstab.reset_index(), id_vars= ['action_type'], value_vars = ['Score','Miss']).sort_values(by = ['value','action_type'], ascending = False).head(20)
plt.figure(figsize = (28,12))
sns.barplot(data = temp, x = 'action_type', y = 'value', hue = 'col_0')
plt.xlabel("Shot Type")
plt.ylabel("Count")
plt.title("Distribution of Observations Among Top Ten Action Types");
From this point onwards, the distribution has an ever longer tail until we get to categories that consist of one observations. This is a skewed distribution at the end of the day.
The problem with this attribute is that there are so many useless categories, each of which will end up becoming a dimension when modeling. This would cripple performance and harm the predictive power of many models. As such, adding it would be problematic.
combo_crosstab = pd.crosstab(index = Xtr.combined_shot_type, columns = np.where(ytr == 1, 'Score', 'Miss')).sort_values(by = 'Score', ascending = False)
temp = pd.melt(combo_crosstab.reset_index(), id_vars= ['combined_shot_type'], value_vars = ['Score','Miss']).sort_values(by = ['value','combined_shot_type'], ascending = False)
plt.figure(figsize = (15,8))
sns.barplot(data = temp, x = 'combined_shot_type', y = 'value', hue = 'col_0')
plt.title('Distribution of Observations Among Combined Shot Types');
As the name suggests, this attribute combines the categories of the prior setup.
Although it has fewer categories, it does suffer from the fact that aggregates too much. We'll see below.
tempy = pd.concat((action_crosstab.iloc[1:3,:],pd.DataFrame(combo_crosstab.iloc[1,:]).T), axis = 0)
tempy.reset_index(inplace = True)
tempy = pd.melt(tempy, id_vars=['index'],value_vars=['Miss','Score'])
plt.figure(figsize=(8,8))
sns.barplot(data = tempy,x = 'index' ,y= 'value', hue='col_0')
plt.xlabel('Type of Layup')
plt.ylabel("Frequency")
plt.title("Scoring Percentages by Layup Shot Type");
Above you see two different styles of layup on the left and the generalized version to the right. The Driving Layup Shot and Normal Layup Shot are both heavily weighted towards either missing or scoring. The aggregated layup is less polarized. We lose knowledge by replacing the two on the left with that one the right.
We need to make our own version of the Combined Shot Type feature that avoids being too generalized or having that long-tail problem.
Step 1: Identify which action types are mapped to which combined shot types
Step 2: Find action types within a combined category holding more than 5% of the total observations.
Step 3: Pull them out as their own label and leave the rest as an 'other' category that contains rarer labels.
# Add a percentage of total observations for each action shot type
action_crosstab['Pct_of_Total'] = (action_crosstab.Score + action_crosstab.Miss)/len(Xtr)
# Based on those percentages, create a list of labels that hold over 1% of obervations
one_percent = list(action_crosstab[action_crosstab.Pct_of_Total >= 0.01].index)
# Find the index of observations within the > 1% category
popular_index = Xtr.loc[Xtr.action_type.isin(one_percent)].index
# Find the index of observations within the <1% category
unpopular_index = Xtr.loc[~Xtr.action_type.isin(one_percent)].index
# Convert the jump shot name in the combined shot type feature into another name so there is less confusion
change_jump_name = Xtr.loc[unpopular_index]['combined_shot_type'].apply(lambda rowval: 'Classical Jump Shot' if rowval == 'Jump Shot' else rowval)
# Create our new feature
Xtr['one_percent_combined_shot'] = pd.concat((Xtr.loc[popular_index]['action_type'], change_jump_name), axis = 0)
one_percent_crosstab = pd.crosstab(index = Xtr['one_percent_combined_shot'], columns=np.where(ytr == 1, 'Score', 'Miss')).reset_index()
one_percent_crosstab = pd.melt(one_percent_crosstab, id_vars=['one_percent_combined_shot'], value_vars=['Miss','Score']).sort_values(by = ['value'], ascending= False)
plt.figure(figsize=(25,14))
sns.barplot(data = one_percent_crosstab,x = 'one_percent_combined_shot' ,y= 'value', hue='col_0')
plt.xlabel('Shot Type')
plt.ylabel("Frequency")
plt.title("Scoring Percentages by Shot Type");
Now we have a nice middle ground. There are 12 labels in this new feature rather than almost 70, and each of these labels provide a close look at their observations.
## Create array of Laker colors
cols = ['#FDB927','#552583']
## Transform into palette
pal = sns.color_palette(cols)
# Configure background image and figure size
plt.figure(figsize = (20,20))
img = plt.imread('bb_court.jpg')
plt.imshow(img, extent=[33.07,34.17,-118.59,-117.95])
plt.axis('off')
# plot data
sns.scatterplot(data = kb, x = 'lat', y = 'lon', hue = 'shot_type', alpha = 0.8, palette = pal);
pd.crosstab(index = Xtr.shot_type, columns = np.where(ytr == 1, 'Score', 'Miss')).sort_values(by = 'Score', ascending = False)
| col_0 | Miss | Score |
|---|---|---|
| shot_type | ||
| 2PT Field Goal | 10602 | 9683 |
| 3PT Field Goal | 3630 | 1782 |
I'm bringing back our prior image to showcase this feature.
Although there is a clear connection between the type of shot made and its percent chance to become a field goal, there is also a relationship between it and the shot distance showcased earlier. The two features are covariant, meaning this feature will be dropped later on.
# Configure background image and figure size
plt.figure(figsize = (20,20))
img = plt.imread('bb_court.jpg')
plt.imshow(img, extent=[33.07,34.17,-118.59,-117.95])
plt.axis('off')
# plot data
sns.scatterplot(data = kb, x = 'lat', y = 'lon', hue = 'shot_zone_area', alpha = 0.8);
pd.crosstab(index = Xtr.shot_zone_area, columns = np.where(ytr == 1, 'Score', 'Miss')).sort_values(by = 'Score', ascending = False)
| col_0 | Miss | Score |
|---|---|---|
| shot_zone_area | ||
| Center(C) | 5356 | 5933 |
| Right Side(R) | 2309 | 1550 |
| Right Side Center(RC) | 2458 | 1523 |
| Left Side(L) | 1889 | 1243 |
| Left Side Center(LC) | 2149 | 1215 |
| Back Court(BC) | 71 | 1 |
Although this feature suffers from the same potential for covariance with shot distance that shot type did, this variable differentiates itself by incorporating the shot angle into the equation. That aspect is not naturally within this dataset.
For example, it's clear that Kobe managed greater success when shooting from the center rather than the left or right, despite the fact that they (mostly) were taken from comparable ranges.
We end up creating a shot angle feature later on, which covers the information relayed by this attribute, so we will keep this away from the model.
Although we've run through and selected (or disqualified) a few features, those decisions were pretty straightforward, and they were direct byproducts of my exploratory data analysis. In this next section, we will be looking through other attributes that either need to be altered or have a more complicated relationship with others in the set.
temp = pd.concat((Xtr[['season','game_date']], ytr), axis = 1)
temp.head()
| season | game_date | shot_made_flag | |
|---|---|---|---|
| 1 | 2000-01 | 2000-10-31 | 0.0 |
| 2 | 2000-01 | 2000-10-31 | 1.0 |
| 3 | 2000-01 | 2000-10-31 | 0.0 |
| 4 | 2000-01 | 2000-10-31 | 1.0 |
| 5 | 2000-01 | 2000-10-31 | 0.0 |
Here we have two features that represent a similar aspect of Kobe's performance, specifically when throughout his career he made a shot. Season represents this as a categorical feature while game date does so more continuously.
Before investigating though, we'll be cleaning these two features by breaking them down into their parts.
temp['game_date'] = pd.to_datetime(temp['game_date'])
temp['score_pct'] = temp.groupby('game_date')['shot_made_flag'].transform('mean')
temp['shot_count'] = temp.groupby('game_date')['shot_made_flag'].transform('size')
temp['year'] = temp.game_date.dt.year
temp['month'] = temp.game_date.dt.month
temp.sort_values('game_date').head()
| season | game_date | shot_made_flag | score_pct | shot_count | year | month | |
|---|---|---|---|---|---|---|---|
| 22901 | 1996-97 | 1996-11-03 | 0.0 | 0.000000 | 1 | 1996 | 11 |
| 22902 | 1996-97 | 1996-11-05 | 0.0 | 0.000000 | 1 | 1996 | 11 |
| 22903 | 1996-97 | 1996-11-06 | 1.0 | 0.666667 | 3 | 1996 | 11 |
| 22904 | 1996-97 | 1996-11-06 | 0.0 | 0.666667 | 3 | 1996 | 11 |
| 22905 | 1996-97 | 1996-11-06 | 1.0 | 0.666667 | 3 | 1996 | 11 |
Now that everything has been disassembled, we'll treat this as a time series, meaning we will look for trends and seasonality. Two questions to keep in mind:
fig, (ax, ax2) = plt.subplots(1,2, figsize = (25,10))
sns.set_palette(sns.color_palette("Paired"))
sns.violinplot(data = temp, x = 'year', y = 'score_pct', ax = ax)
sns.violinplot(data = temp, x = 'month', y = 'score_pct', ax = ax2)
plt.suptitle("Shot Accuracy Trend and Seasonality")
ax.title.set_text('Accuracy Trend')
ax2.title.set_text("Accuracy Seasonality");
This visualization provides some clear answers to our prior answers earlier:
Generally, there seems to be no seasonality. Kobe, professionally, did not care much whether or not it was November or March.
In short, this set of features can be boiled down to the predictive power of the season Kobe played in. Let's look closer at it.
temp = pd.crosstab(index = temp.year, columns = np.where(ytr == 1, 'Score', 'Miss'))
temp['Ratio'] = temp['Score']/(temp['Miss'] + temp['Score'])
temp.Ratio.describe()
count 21.000000 mean 0.433984 std 0.036569 min 0.348884 25% 0.430400 50% 0.441935 75% 0.460834 max 0.476932 Name: Ratio, dtype: float64
It seems that the late-90s and early 2000s were pretty good years for Kobe's accuracy. There's a decent range invovled with this feature, but the deviation is not that high. We'll keep the season feature for the model, but the game date and the seasonality components can be dropped.
temp = pd.crosstab(index = Xtr.playoffs, columns = np.where(ytr == 1, 'Score', 'Miss')).sort_values(by = 'Score', ascending = False)
temp['Ratio'] = temp['Score']/(temp['Miss'] + temp['Score'])
temp
| col_0 | Miss | Score | Ratio |
|---|---|---|---|
| playoffs | |||
| 0 | 12145 | 9794 | 0.446420 |
| 1 | 2087 | 1671 | 0.444651 |
plt.figure(figsize = (8,8))
sns.barplot(x = temp.index, y = temp.Ratio)
plt.xlabel("Playoffs?")
plt.ylabel("Scoring Ratio")
plt.title("Did The Playoffs Matter? No They Did Not.");
Generally, Kobe didn't seem to be affected by playoff pressure. These results would push us towards getting rid of the feature as a predictive variable.
Our next features are all connected in that they describe the points during individual games that Kobe made or missed a shot. What's interesting about these features is that there is an argument to both combine them into one overall clock and not to.
For: Combinging them all into one would allow us to track shots continuously throughout the game.
Against: This is not exactly how basketball works; shots are more likely to happen as the timer runs down, especially as the number of periods drags on. A continuous clock would falter when explaining hopeful shots made just before halftime and the like.
temp = Xtr[['minutes_remaining','period','seconds_remaining']]
temp.head()
| minutes_remaining | period | seconds_remaining | |
|---|---|---|---|
| 1 | 10 | 1 | 22 |
| 2 | 7 | 1 | 45 |
| 3 | 6 | 1 | 52 |
| 4 | 6 | 2 | 19 |
| 5 | 9 | 3 | 32 |
Continuous intervals interrupted by quarter starts and stops seems to be the best way forward. I will merge the minutes and seconds features into one but leave the quarter variable alone. In short, I will be treating this also like a time series.
temp['seconds_remaining_in_quarter'] = (pd.to_numeric(temp['minutes_remaining'])*60)+(pd.to_numeric(temp['seconds_remaining']))
temp = pd.concat((temp, ytr), axis = 1)
temp['score_pct'] = temp.groupby('minutes_remaining')['shot_made_flag'].transform('mean')
temp[['period','seconds_remaining_in_quarter','shot_made_flag','score_pct']].head()
| period | seconds_remaining_in_quarter | shot_made_flag | score_pct | |
|---|---|---|---|---|
| 1 | 1 | 622 | 0.0 | 0.454950 |
| 2 | 1 | 465 | 1.0 | 0.481876 |
| 3 | 1 | 412 | 0.0 | 0.469965 |
| 4 | 2 | 379 | 1.0 | 0.469965 |
| 5 | 3 | 572 | 0.0 | 0.479744 |
plt.figure(figsize = (7,7))
sns.boxplot(x = temp['period'], y = temp['score_pct'])
plt.title("Distribution of Shot Accuracy by Period");
There's a clear difference in shot quality by period, primarily between "normal" periods and irregular ones.
How does the minute marker affect performance?
temp['score_pct'] = temp.groupby('seconds_remaining')['shot_made_flag'].transform('mean')
plt.figure(figsize=(7,7))
sns.boxplot(x = temp['minutes_remaining'], y = temp['score_pct'])
plt.title("Distribution of Accuracy By Minutes Remaining");
We seem to have reasonably consistent performance across minutes throughout a quarter with one exception. As the last minute runs down, the accuracy suffers.
Remember that many shots were attempted in the last second; these would reasonably have a lower accuracy than their normal counterparts, so we should create a feature that differentiates the two.
last_second = Xtr[(Xtr['seconds_remaining'] == 0) & (Xtr['minutes_remaining'] == 0)]
last_second_shot = ytr.loc[last_second.index]
temp = pd.concat((last_second, last_second_shot), axis = 1)
second = temp.shot_made_flag.mean()
second
0.18036529680365296
The average scoring rate for shots made in the last second are just under 20%. This is a clear drop in accuracy from the general average, but we are representing a small subsection of the total population (~600 vs. 36000).
What if we check how all shots in the final minute compare?
last_minute = Xtr[Xtr['minutes_remaining'] == 0]
last_minute_shot = ytr.loc[last_minute.index]
temp = pd.concat((last_minute, last_minute_shot), axis = 1)
minute = temp.shot_made_flag.mean()
minute
0.3805418719211823
The average scoring rate for shots made in the last minute is significantly higher than the last second but not as high as the general average. This makes intuitive sense since the last minute is a middle ground between normal play and a desperate shot at the last second.
plt.figure(figsize = (7,7))
sns.barplot(x = ['last_second', 'last_minute', 'average'], y = [second, minute, ytr.mean()])
plt.xlabel("When was the shot taken?")
plt.ylabel('Scoring Ratio')
plt.title("Comparison of Shot Accuracy By Time Left in Period");
It seems that both the last second and last minute shots would be beneficial in classifying a shot as being a field goal or a miss, but they would likely have a fair degree of covariance. Let's see if its a problem.
temp1 = pd.Series(np.where((Xtr['seconds_remaining'] == 0) & (Xtr['minutes_remaining'] == 0), 1, 0))
temp2 = pd.Series(np.where(Xtr['minutes_remaining'] == 0, 1, 0))
pd.concat((temp1, temp2), axis = 1).corr()
| 0 | 1 | |
|---|---|---|
| 0 | 1.000000 | 0.346194 |
| 1 | 0.346194 | 1.000000 |
With a correlation of 0.35, they have a little bit of covariance but not enough to be considered problematic alone. We'll add both features.
Next up, how does the game's location affect Kobe's shot performance? Does he have a home turf advantage?
temp = pd.concat((Xtr['matchup'], ytr), axis = 1)
temp['game_loc'] = np.where(Xtr['matchup'].str.contains('@'), 0, 1)
temp.head()
| matchup | shot_made_flag | game_loc | |
|---|---|---|---|
| 1 | LAL @ POR | 0.0 | 0 |
| 2 | LAL @ POR | 1.0 | 0 |
| 3 | LAL @ POR | 0.0 | 0 |
| 4 | LAL @ POR | 1.0 | 0 |
| 5 | LAL @ POR | 0.0 | 0 |
temp = pd.crosstab(temp['game_loc'], temp['shot_made_flag'])
temp['score_pct'] = temp.loc[:,1]/(temp.loc[:,1]+temp.loc[:,0])
temp
| shot_made_flag | 0.0 | 1.0 | score_pct |
|---|---|---|---|
| game_loc | |||
| 0 | 7446 | 5766 | 0.436421 |
| 1 | 6786 | 5699 | 0.456468 |
Surprisingly, there does not seem to be a difference between Kobe's performance away versus at home.
The quality of the opposing team is also a theoretical factor for the number of shots Kobe would make versus miss.
temp = pd.concat((Xtr, ytr), axis = 1)
temp['score_pct_by_game'] = temp.groupby('game_event_id')['shot_made_flag'].transform('mean')
temp['score_pct'] = temp.groupby('opponent')['shot_made_flag'].transform('mean')
temp[['opponent','score_pct']].drop_duplicates().sort_values('score_pct').head()
| opponent | score_pct | |
|---|---|---|
| 19587 | BKN | 0.400000 |
| 293 | IND | 0.400958 |
| 6279 | NOP | 0.407666 |
| 501 | MIL | 0.410256 |
| 2534 | BOS | 0.411239 |
plt.figure(figsize = (20,10))
sns.boxplot(x = temp['opponent'], y = temp['score_pct_by_game'])
plt.title("Distribution of Accuracy Based on Opponent");
According to the shot scoring percentage by team, there is not a noticeable difference in performance based on the opponent Kobe faced. If you look at the distribution of field goal percentage on a per-team basis, there is no clear relationship.
We'll remove opponents as a predictive feature.
This was a late addition. After flirting with making do with the shot zone area, I decided to calculate the angle at which Kobe looked to make a shot.
This visual may help inspire:
## Create array of Laker colors
cols = ['#FDB927','#552583']
## Transform into palette
pal = sns.color_palette(cols)
plt.figure(figsize = (8,8))
sns.scatterplot(data = Xtr, x = 'loc_x',y = 'loc_y', hue = ytr, palette = pal)
plt.xlabel("X Coordinate")
plt.ylabel("Y Coordinate");
After seeing this, it may come to mind that you can find the shot angle by utilizing three points:
Let's take one observation as an example:
plt.figure(figsize = (8,8))
sns.scatterplot(x = [0, -33, 250], y = [0, 125, 0])
sns.lineplot(x = [0, 250], y = [0,0])
sns.lineplot(x = [0,-33],y = [0,125])
plt.title("Visualizing The Three Points Of A Shot Taken At [{},{}]".format(Xtr.loc_x.iloc[7],Xtr.loc_y.iloc[7]));
Now that we have our vectors, let's find the calculation of the angle between them. This is formulated through the cosine, which is reached by finding the dot product between the two vectors divided by the product of each vector's magnitude.
def angles(val):
x = val[0]
y = val[1]
if (x == 0) & (y == 0):
y = 1
a = np.array([-x, y])
b = np.array([0,0])
c = np.array([250, 0])
ba = b - a
bc = c - b
cos = np.dot(ba, bc) / (np.linalg.norm(ba) * (np.linalg.norm(bc)))
angle = np.arccos(cos)
result = abs(np.degrees(angle)-90)
return result
angles((Xtr.loc_x.iloc[7], Xtr.loc_y.iloc[7])) + 90
104.78867761416518
It seems that the angle for the above observation was 104.79 degrees. There is a slight problem with the new feature: the relationship is parabolic. You could see this below.
Xtr['combo'] = Xtr[['loc_x','loc_y']].values.tolist()
Xtr['shot_angle'] = Xtr['combo'].apply(angles)
Xtr['orig_shot_angle'] = Xtr['shot_angle'] + 90
# Configure background image and figure size
plt.figure(figsize = (20,20))
img = plt.imread('bb_court.jpg')
plt.imshow(img, extent=[33.07,34.17,-118.59,-117.95])
plt.axis('off')
# plot data
sns.scatterplot(data = Xtr, x = 'lat', y = 'lon', hue = 'orig_shot_angle', alpha = 0.8);
Kobe has the best chance of sinking a shot when he's directly in front of the net. In that position, he can take full advantage of the backboard. As the angle becomes more extreme (in either direction), the benefit decays.
We fixed this by adjusting the feature to represent angles in a manner where the straight shot is 0 and anything that isn't is a higher number. We subtracted 90 from the angle and then calculated the absolute value. Shots from 90 degrees become 0, and those from 0 or 180 degrees become 90. In this way the relationship becomes a negatively correlated and linear.
temp = pd.concat((Xtr['shot_angle'], ytr), axis = 1)
temp['shot_made_flag'] = np.where(ytr == 1, 'Score', 'Miss')
plt.figure(figsize = (10,10))
sns.violinplot(data = temp, x = 'shot_made_flag', y = 'shot_angle')
plt.xlabel("Shot Result")
plt.ylabel("Shot Angle")
plt.title("The Effects of a Shot's Angle on Accuracy");
We now have our shot angle feature! As a result, the lower the adjusted angle was, the more likely Kobe was to score.
Shot Type: What type of shot style did Kobe use? (layup, dunk, hook shot?)
Period: What point of the game did he shoot?
Shot Zone: At what section of the court was Kobe situated when he shot?
Shot Distance: How far was he from the hoop?
Season: At what stage of his career did he make that shot?
Last Second Check: Was the shot made as the clock ran out?
Last Minute Check: Was it made when the clock was running out?
from sklearn.base import BaseEstimator, TransformerMixin
class GoldilocksShotType(BaseEstimator, TransformerMixin):
def __init__(self, goldi_int = True):
self.goldi_int = goldi_int
def fit(self, X, y = None):
return self
def transform(self, X):
action_crosstab = pd.DataFrame(X['action_type'].value_counts())
action_crosstab.reset_index(inplace = True)
# Add a percentage of total observations for each action shot type
action_crosstab['Pct_of_Total'] = action_crosstab['action_type']/len(Xtr)
# Based on those percentages, create a list of labels that hold over 1% of obervations
one_percent = list(action_crosstab[action_crosstab.Pct_of_Total >= 0.01].index)
# Find the index of observations within the > 1% category
popular_index = X.loc[X.action_type.isin(one_percent)].index
# Find the index of observations within the <1% category
unpopular_index = X.loc[~X.action_type.isin(one_percent)].index
# Convert the jump shot name in the combined shot type feature into another name so there is less confusion
change_jump_name = X.loc[unpopular_index]['combined_shot_type'].apply(lambda rowval: 'Classical Jump Shot' if rowval == 'Jump Shot' else rowval)
# Create our new feature
X['one_percent_combined_shot'] = pd.concat((X.loc[popular_index]['action_type'], change_jump_name), axis = 0)
return X
class SeasonInteger(BaseEstimator, TransformerMixin):
def __init___(self, season_int = True):
self.season_int = season_int
def fit(self, X, y = None):
return self
def transform(self, X):
season = pd.to_numeric(X.iloc[:,11].str.replace('-..', '', regex = True))
X.iloc[:,11] = season
return X
class LastMin(BaseEstimator, TransformerMixin):
def __init__(self, make_vars = True):
self.make_vars = make_vars
def fit(self, X, y = True):
return self
def transform(self, X):
last_minute_shot = np.where(X.iloc[:,8] == 0, 1, 0)
X.iloc[:,8] = last_minute_shot
return X
class LastSec(BaseEstimator, TransformerMixin):
def __init__(self, make_vars = True):
self.make_vars = make_vars
def fit(self, X, y = True):
return self
def transform(self, X):
last_second_shot = np.where((X.iloc[:,8]== 0) & (X.iloc[:,12] == 0), 1, 0)
X.iloc[:,12] = last_second_shot
return X
class AngleShot(BaseEstimator, TransformerMixin):
def __init__(self, make_vars = True):
self.make_vars = make_vars
def angles(val):
x = val[0]
y = val[1]
if (x == 0) & (y == 0):
y = 1
a = np.array([-x, y])
b = np.array([0,0])
c = np.array([250, 0])
ba = b - a
bc = c - b
cos = np.dot(ba, bc) / (np.linalg.norm(ba) * (np.linalg.norm(bc)))
angle = np.arccos(cos)
result = abs(np.degrees(angle) - 90)
return result
def fit(self, X, y = True):
return self
def transform(self, X):
X['combo'] = X.iloc[:,5:7].values.tolist()
X['shot_angle'] = X['combo'].apply(angles)
return X
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
cols = make_column_transformer(
(StandardScaler(),['shot_distance','period','season','shot_angle']),
(OneHotEncoder(), ['one_percent_combined_shot']),
('passthrough' , ['seconds_remaining','minutes_remaining'])
)
pipes = make_pipeline(
# Create new vars through pipeline
GoldilocksShotType(),
SeasonInteger(),
LastSec(),
LastMin(),
AngleShot(),
# Standardize and one hot
cols
)
Xtr = pipes.fit_transform(Xtr)
Now that everything is sorted, we'll begin modeling. Let's start by creating a function that will fit and apply a model of our choice before providing several evaluation metrics to guage each model's predictive power.
from sklearn.model_selection import cross_val_score
def get_cross_val_scores(mod, Xtrain, ytrain):
## Stratified KFold - 20 splits
accscores = cross_val_score(mod, Xtrain, ytrain, scoring = 'accuracy',cv = 20)
f1scores = cross_val_score(mod, Xtrain, ytrain, scoring = 'f1', cv = 20)
roc_auc = cross_val_score(mod, Xtrain, ytrain, scoring = 'roc_auc', cv = 20)
return [accscores, f1scores, roc_auc]
def describe_accuracy(scores):
print('Accuracy Scores: ', scores[0], '\n')
print('Accuracy Mean Score: ', scores[0].mean())
print('Accuracy Scores STD: ', scores[0].std(), '\n','\n')
def describe_f1score(scores):
print('F1 Scores: ', scores[1], '\n')
print('F1 Mean Score: ', scores[1].mean())
print('F1 Scores STD: ', scores[1].std(), '\n')
def describe_roc(scores):
print('ROC_AUC Scores: ', scores[2], '\n')
print('ROC_AUC Mean Score: ', scores[2].mean())
print('ROC_AUC Scores STD: ', scores[2].std(), '\n')
def display_scores(scores):
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (20,10))
sns.set_palette(sns.color_palette("Set2"))
sns.histplot(scores[0], bins = 10, ax = ax1)
sns.histplot(scores[2], bins = 10, ax = ax2)
ax1.title.set_text('Accuracy Distribution')
ax2.title.set_text("ROC/AUC Distribution")
plt.suptitle("Distribution of Scores");
We picked accuracy as a baseline metric due to its comparability. We'll also provide a confusion matrix, the F1-Score (which incorporates both Precision and Recall), and a visualization of the latter.
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
dum = DummyClassifier()
sc = get_cross_val_scores(dum, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.492607 0.49494163 0.50894942 0.50194553 0.49805447 0.50116732 0.49571984 0.51361868 0.50972763 0.51439689 0.5151751 0.50505837 0.50116732 0.48949416 0.51595331 0.4770428 0.51361868 0.49610592 0.51713396 0.51635514] Accuracy Mean Score: 0.5039116573936021 Accuracy Scores STD: 0.010676549819741088 F1 Scores: [0.44677138 0.46020761 0.44364292 0.45840407 0.46666667 0.47308032 0.44288225 0.44812554 0.45183291 0.44541485 0.4458042 0.44212359 0.40459364 0.43799472 0.42477876 0.43630017 0.42230347 0.43954104 0.4353042 0.43776824] F1 Mean Score: 0.44317702771432144 F1 Scores STD: 0.015058240540642951 ROC_AUC Scores: [0.51428638 0.48259824 0.51469327 0.50147558 0.48149278 0.5086684 0.49885778 0.49006927 0.50477479 0.52937795 0.4947068 0.50366566 0.49615549 0.51691684 0.49745782 0.50399521 0.51109126 0.51107503 0.48118571 0.50567006] ROC_AUC Mean Score: 0.502410715244629 ROC_AUC Scores STD: 0.012265488836936267
display_scores(sc)
As expected, the model performed poorly. We will at least be able to compare more complicated models against it.
dt = DecisionTreeClassifier()
sc = get_cross_val_scores(dt, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.5618677 0.58287938 0.6077821 0.5922179 0.60311284 0.5844358 0.59143969 0.58910506 0.60311284 0.57276265 0.53929961 0.54552529 0.50817121 0.49805447 0.57354086 0.57042802 0.59455253 0.58878505 0.59813084 0.57165109] Accuracy Mean Score: 0.5738427457968168 Accuracy Scores STD: 0.02949100600257557 F1 Scores: [0.52356021 0.55102041 0.55357143 0.5499154 0.55536332 0.52196837 0.53671329 0.5353095 0.57404326 0.52722558 0.45791855 0.43419789 0.375852 0.36714976 0.20899855 0.52233677 0.56202532 0.5539258 0.54792197 0.52686308] F1 Mean Score: 0.4992940222208281 F1 Scores STD: 0.08803190898015543 ROC_AUC Scores: [0.57014628 0.58501235 0.59734396 0.57358521 0.57433158 0.57727293 0.57279105 0.582071 0.58735073 0.58640337 0.5393381 0.5290924 0.4941683 0.4738921 0.53148753 0.55798502 0.59701211 0.56611267 0.58819645 0.57420048] ROC_AUC Mean Score: 0.5628896808621378 ROC_AUC Scores STD: 0.03247455275420586
display_scores(sc)
The decision tree performed markedly better than the baseline, but 60% is not too exciting either. That said, we have more powerful models coming up.
kn = KNeighborsClassifier(n_neighbors=4)
sc = get_cross_val_scores(kn, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.63735409 0.64124514 0.61867704 0.61789883 0.61089494 0.62879377 0.63346304 0.62801556 0.63891051 0.60856031 0.60933852 0.62256809 0.5844358 0.61089494 0.62412451 0.63579767 0.64824903 0.62538941 0.65654206 0.58489097] Accuracy Mean Score: 0.6233022109894905 Accuracy Scores STD: 0.018116810112235544 F1 Scores: [0.49347826 0.53102747 0.4874477 0.49537513 0.4612069 0.47639956 0.51292658 0.4882227 0.536 0.48304214 0.47046414 0.48128342 0.41189427 0.45652174 0.39245283 0.47884187 0.54251012 0.48445874 0.54395036 0.42253521] F1 Mean Score: 0.482501956300604 F1 Scores STD: 0.039926555655807855 ROC_AUC Scores: [0.65412304 0.65069146 0.62131841 0.61545287 0.63089863 0.64095682 0.63420275 0.63769683 0.64050949 0.62235156 0.60975033 0.62189688 0.58206898 0.630885 0.66122456 0.66729884 0.67370637 0.63940619 0.65833953 0.59476857] ROC_AUC Mean Score: 0.6343773549586239 ROC_AUC Scores STD: 0.022873228275971016
display_scores(sc)
logit = LogisticRegression()
sc = get_cross_val_scores(logit, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.68326848 0.69805447 0.70350195 0.67626459 0.70194553 0.67237354 0.68015564 0.70116732 0.69649805 0.67782101 0.68326848 0.67003891 0.63268482 0.68249027 0.64902724 0.68638132 0.68093385 0.68925234 0.68457944 0.6588785 ] Accuracy Mean Score: 0.6804292883377577 Accuracy Scores STD: 0.017424797950843986 F1 Scores: [0.5452514 0.59583333 0.57525084 0.5602537 0.58142077 0.50989523 0.53768279 0.56950673 0.59375 0.58266129 0.6037001 0.57344064 0.5203252 0.61940299 0.56842105 0.55762898 0.54342984 0.57417289 0.57680251 0.52391304] F1 Mean Score: 0.5656371656657866 F1 Scores STD: 0.027963137810312996 ROC_AUC Scores: [0.69373934 0.70566038 0.71175265 0.70501819 0.71511314 0.67859139 0.67399798 0.69990269 0.71612668 0.70981014 0.71235318 0.70419584 0.67076111 0.71406519 0.6656988 0.69329893 0.68695512 0.69841778 0.70231933 0.67500853] ROC_AUC Mean Score: 0.6966393188236975 ROC_AUC Scores STD: 0.015771601697121242
display_scores(sc)
Another noticeable improvement. Let's see if we can do better.
mnb = BernoulliNB()
sc = get_cross_val_scores(mnb, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.66692607 0.67315175 0.66381323 0.6459144 0.65603113 0.65525292 0.64435798 0.66459144 0.65758755 0.64980545 0.66225681 0.68093385 0.64202335 0.67859922 0.65136187 0.66225681 0.68404669 0.66121495 0.66043614 0.64096573] Accuracy Mean Score: 0.6600763664133241 Accuracy Scores STD: 0.012210428265628376 F1 Scores: [0.5694165 0.61182994 0.56275304 0.59119497 0.58998145 0.54376931 0.54886476 0.5745311 0.59259259 0.59677419 0.61996497 0.61682243 0.57328386 0.6491079 0.5942029 0.59056604 0.59236948 0.58923513 0.59099437 0.55111977] F1 Mean Score: 0.5874687344617111 F1 Scores STD: 0.025064467475764175 ROC_AUC Scores: [0.6678334 0.68812871 0.68511015 0.66466165 0.68927707 0.65505201 0.64603923 0.66965949 0.69012148 0.67886469 0.6897869 0.69614634 0.66107754 0.70925893 0.66148919 0.68528279 0.6878409 0.66783995 0.67390888 0.65026767] ROC_AUC Mean Score: 0.6758823487713091 ROC_AUC Scores STD: 0.016248829161232933
display_scores(sc)
svc = SVC()
sc = get_cross_val_scores(svc, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.68249027 0.69805447 0.70350195 0.67392996 0.70116732 0.67470817 0.68171206 0.70038911 0.69727626 0.6770428 0.67937743 0.67237354 0.62568093 0.69182879 0.65447471 0.68560311 0.68326848 0.68847352 0.68691589 0.65576324] Accuracy Mean Score: 0.6807016012703493 Accuracy Scores STD: 0.01825257238650346 F1 Scores: [0.54464286 0.5966736 0.57713651 0.55848261 0.58169935 0.5162037 0.54199328 0.57079153 0.59521332 0.58291457 0.60536398 0.57688442 0.51068159 0.63535912 0.54414784 0.55895197 0.54827969 0.57356077 0.58125 0.51746725] F1 Mean Score: 0.5658848978712055 F1 Scores STD: 0.030876667712166597 ROC_AUC Scores: [0.6897342 0.69837319 0.70917652 0.67164368 0.70562607 0.66099967 0.67479705 0.68951973 0.68235999 0.67579588 0.68166755 0.68969498 0.65216337 0.72206663 0.66560446 0.66574291 0.66949186 0.65991782 0.69761391 0.65111327] ROC_AUC Mean Score: 0.6806551358454933 ROC_AUC Scores STD: 0.018991639139948408
display_scores(sc)
The SVM's predictive power was comparable to the Logistic Regresssor.
rfc = RandomForestClassifier()
sc = get_cross_val_scores(rfc, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.62879377 0.64124514 0.64669261 0.61789883 0.6381323 0.64046693 0.61712062 0.61867704 0.63579767 0.6155642 0.59299611 0.58599222 0.55564202 0.55642023 0.58832685 0.6155642 0.63968872 0.65109034 0.66510903 0.60903427] Accuracy Mean Score: 0.6180126550056365 Accuracy Scores STD: 0.0289791577760127 F1 Scores: [0.55919854 0.57195572 0.56818182 0.56834532 0.57490637 0.5372549 0.54612546 0.56438356 0.59201389 0.55315315 0.50885368 0.42584746 0.41243523 0.46332046 0.17674419 0.51051625 0.57904085 0.59360731 0.61217075 0.53938832] F1 Mean Score: 0.5228721623466192 F1 Scores STD: 0.09498450018868133 ROC_AUC Scores: [0.66030722 0.66423147 0.67544047 0.64763001 0.67664765 0.65552753 0.65565131 0.66653431 0.66669731 0.63715022 0.60445712 0.57817004 0.55348016 0.57392297 0.63326056 0.62457059 0.69080085 0.69188248 0.69286309 0.64510202] ROC_AUC Mean Score: 0.6447163693237721 ROC_AUC Scores STD: 0.03901128220364295
display_scores(sc)
ada = AdaBoostClassifier()
sc = get_cross_val_scores(ada, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.68171206 0.69727626 0.70272374 0.6770428 0.70038911 0.67392996 0.68093385 0.70116732 0.69727626 0.67782101 0.67859922 0.63891051 0.60622568 0.57431907 0.5618677 0.68638132 0.68326848 0.6923676 0.68613707 0.65809969] Accuracy Mean Score: 0.6678224359673685 Accuracy Scores STD: 0.04002728130986336 F1 Scores: [0.54301676 0.59521332 0.57461024 0.56084656 0.58015267 0.51448436 0.53932584 0.5704698 0.59521332 0.58350101 0.59310345 0.49010989 0.39036145 0.43550052 0.05059022 0.55859803 0.54827969 0.58023379 0.57977059 0.52230686] F1 Mean Score: 0.5202844180577163 F1 Scores STD: 0.11981646174387378 ROC_AUC Scores: [0.70119566 0.71620635 0.72734548 0.69969434 0.71166073 0.67632777 0.68712008 0.70389435 0.70225087 0.65602021 0.65267075 0.57831343 0.55496503 0.56388656 0.66687127 0.70092548 0.69490142 0.71644539 0.71379199 0.68214274] ROC_AUC Mean Score: 0.6753314949830667 ROC_AUC Scores STD: 0.05008569695983133
display_scores(sc)
xg = XGBClassifier()
sc = get_cross_val_scores(xg, Xtr, ytr)
describe_accuracy(sc)
describe_f1score(sc)
describe_roc(sc)
Accuracy Scores: [0.66848249 0.68560311 0.68404669 0.65525292 0.69494163 0.66459144 0.64980545 0.6848249 0.65836576 0.62879377 0.59688716 0.5540856 0.5540856 0.48015564 0.57276265 0.62178988 0.64747082 0.67912773 0.6876947 0.63084112] Accuracy Mean Score: 0.6349804538346849 Accuracy Scores STD: 0.055552136906050785 F1 Scores: [0.58070866 0.59760956 0.57883817 0.54470709 0.58386412 0.51950948 0.5302714 0.56684492 0.57747834 0.52725471 0.48814229 0.30545455 0.33138856 0.34122288 0.10440457 0.49056604 0.55890944 0.57613169 0.58010471 0.49681529] F1 Mean Score: 0.4940113236643879 F1 Scores STD: 0.12389291223121576 ROC_AUC Scores: [0.6888714 0.69952889 0.70955767 0.67307391 0.69635714 0.65998735 0.64899283 0.68796326 0.66946095 0.62699522 0.58777477 0.49929898 0.50990287 0.49911054 0.63415247 0.6425092 0.68953038 0.69898724 0.71049796 0.66133043] ROC_AUC Mean Score: 0.6446941730644363 ROC_AUC Scores STD: 0.06675145551388567
display_scores(sc)
from sklearn.model_selection import GridSearchCV
def grid_search_test(mod, paramgrid):
grid_search = GridSearchCV(mod, paramgrid, scoring = 'accuracy', cv = 20, return_train_score = True)
grid_search.fit(Xtr, ytr)
print('Best Estimator Paramters: ', grid_search.best_params_, '\n')
print('Best Evaluation Score: ', grid_search.best_score_)
paramgrid = [{
'n_neighbors': [5, 6, 7], 'weights':['uniform','distance']
}]
grid_search_test(kn, paramgrid)
Best Estimator Paramters: {'n_neighbors': 6, 'weights': 'uniform'}
Best Evaluation Score: 0.6316299077542213
paramgrid = [{
'kernel':['linear', 'poly', 'rbf'],'gamma':['auto','scale']
}]
grid_search_test(svc, paramgrid)
Best Estimator Paramters: {'gamma': 'scale', 'kernel': 'poly'}
Best Evaluation Score: 0.8252588072018384
paramgrid = [{
'n_estimators':[300, 400, 500]
}]
grid_search_test(rfc, paramgrid)
Best Estimator Paramters: {'n_estimators': 500}
Best Evaluation Score: 0.6174682109652472
paramgrid = [{
'n_estimators':[50, 70, 90], 'learning_rate':[0.17, 0.33, 0.5]
}]
grid_search_test(ada, paramgrid)
Best Estimator Paramters: {'learning_rate': 0.33, 'n_estimators': 70}
Best Evaluation Score: 0.6764215365407227
paramgrid = [{
'C':[0, 0.17, 0.33, 0.5]
}]
grid_search_test(logit, paramgrid)
Best Estimator Paramters: {'C': 0.33}
Best Evaluation Score: 0.6806238408669406
from sklearn.ensemble import VotingClassifier
vtc = VotingClassifier(
estimators = [
('knn',KNeighborsClassifier(n_neighbors=6, weights='uniform')),
('logit', LogisticRegression(C = 0.33)),
('nb', BernoulliNB()),
('rf', RandomForestClassifier(n_estimators=500, max_features='auto')),
('ada', AdaBoostClassifier(n_estimators=70, learning_rate=0.33)),
('xgboost', XGBClassifier()),
('svm', SVC(kernel='poly', gamma='scale', probability=True))],
voting='soft'
)
votscores = cross_val_score(vtc, Xtr, ytr, cv = 5, scoring = 'accuracy')
print('Accuracy Scores: ', votscores, '\n')
print('Accuracy Mean Score: ', votscores.mean())
print('Accuracy Scores STD: ', votscores.std(), '\n','\n')
Accuracy Scores: [0.68326848 0.67062257 0.6781475 0.66043977 0.67192061] Accuracy Mean Score: 0.6728797862988681 Accuracy Scores STD: 0.007699906089555585
from sklearn.ensemble import StackingClassifier
stk = StackingClassifier(
estimators = [
('knn',KNeighborsClassifier(n_neighbors=6, weights='uniform')),
('logit', LogisticRegression(C = 0.33)),
('nb', BernoulliNB()),
('rf', RandomForestClassifier(n_estimators=500, max_features='auto')),
('ada', AdaBoostClassifier(n_estimators=70, learning_rate=0.33)),
('xgboost', XGBClassifier()),
('svm', SVC(kernel='rbf', gamma='scale'))
], final_estimator = LogisticRegression()
)
stkscores = cross_val_score(vtc, Xtr, ytr, cv = 5, scoring = 'accuracy')
print('Accuracy Scores: ', stkscores, '\n')
print('Accuracy Mean Score: ', stkscores.mean())
print('Accuracy Scores STD: ', stkscores.std(), '\n','\n')
Accuracy Scores: [0.68599222 0.68424125 0.68379062 0.65460206 0.6800934 ] Accuracy Mean Score: 0.6777439099644665 Accuracy Scores STD: 0.011729261489499541
class MyStandardScaler(BaseEstimator, TransformerMixin):
def __init__(self):
self.standard = StandardScaler()
def fit(self, x, y=0):
return self.standard.fit(x)
def transform(self, x, y=0):
return self.standard.transform(x)
class MyOneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.encoder = OneHotEncoder()
def fit(self, x, y=0):
return self.encoder.fit(x)
def transform(self, x, y=0):
return self.encoder.transform(x)
final_cols = make_column_transformer(
(MyStandardScaler(),['shot_distance','period','season','shot_angle']),
(MyOneHotEncoder(), ['one_percent_combined_shot']),
('passthrough' , ['seconds_remaining','minutes_remaining'])
)
pipe_test = make_pipeline(
GoldilocksShotType(),
SeasonInteger(),
LastSec(),
LastMin(),
AngleShot(),
# Standardize and one hot
final_cols,
vtc
)
pipe_test.fit(Xtr, ytr)
yhat = pipe_test.predict(Xte)
## Create array of Laker colors
cols = ['#FDB927','#552583']
## Transform into palette
pal = sns.color_palette(cols)
# Configure background image and figure size
plt.figure(figsize = (20,20))
img = plt.imread('bb_court.jpg')
plt.imshow(img, extent=[33.07,34.17,-118.59,-117.95])
plt.axis('off')
# plot data
sns.scatterplot(data = Xte, x = 'lat', y = 'lon', hue = yhat, alpha = 0.8, palette = pal);
The Kaggle competition requires submissions be in the form of probabilities rather than predictions. Personally, I see the above sections as the real output, but we'll perform these steps as well.
## Find observations with missing shot_made_flag values
test_index = kb[kb.shot_made_flag != kb.shot_made_flag].index
## split between overall train and testing sets
train = kb.drop(index=test_index)
test = kb.loc[test_index]
## Break further into X and y for each
Xtr, ytr = train.drop(['shot_made_flag'], axis = 1), train.shot_made_flag
Xte, yte = test.drop(['shot_made_flag'], axis = 1), test.shot_made_flag
yhat_pct = pipe_test.predict_proba(Xte)
num = list(Xte.shot_id)
shot_made_pct = pd.Series(yhat_pct[:,1])
result = pd.concat((pd.DataFrame(num), shot_made_pct), axis = 1)
result.columns = ['shot_id','shot_made_flag']
result.to_csv('./shot_pct_submission.csv', index = False)